Prosper Loan Exploration

by Ahmed Gharib

Introduction

About Prosper

From the Prosper website. Prosper was founded in 2005 as the first peer-to-peer lending marketplace in the United States. Since then, Prosper has facilitated more than $17 billion in loans to more than 1,030,000 people.

Through Prosper, people can invest in each other in a way that is financially and socially rewarding. Borrowers apply online for a fixed-rate, fixed-term loan between \$2,000 and \\$40,000. Individuals and institutions can invest in the loans and earn attractive returns. Prosper handles all loan servicing on behalf of the matched borrowers and investors.

Prosper Marketplace is backed by leading investors including Sequoia Capital, Francisco Partners, Institutional Venture Partners, and Credit Suisse NEXT Fund.

Preliminary Wrangling

Prosper Loan Dataset (Last update 03/11/2014)

  • This dataset contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.

  • This data dictionary explains the variables in the data set.

In [2]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

# Setting up the general theme of charts and color palette to use
sns.set_theme(style='white', palette='Set2')
base_color = '#00334e'

# To display charts in the same notebook
%matplotlib inline

# Pandas diplay options
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.set_option('display.float_format', lambda x: '%.5f' % x)
In [3]:
# First let's start by reading the data dictionary to understand each column of the data 
data_dict = pd.read_excel('data/Prosper Loan Data - Variable Definitions.xlsx')
data_dict
Out[3]:
Variable Description
0 ListingKey Unique key for each listing, same value as the...
1 ListingNumber The number that uniquely identifies the listin...
2 ListingCreationDate The date the listing was created.
3 CreditGrade The Credit rating that was assigned at the tim...
4 Term The length of the loan expressed in months.
5 LoanStatus The current status of the loan: Cancelled, Ch...
6 ClosedDate Closed date is applicable for Cancelled, Compl...
7 BorrowerAPR The Borrower's Annual Percentage Rate (APR) fo...
8 BorrowerRate The Borrower's interest rate for this loan.
9 LenderYield The Lender yield on the loan. Lender yield is ...
10 EstimatedEffectiveYield Effective yield is equal to the borrower inter...
11 EstimatedLoss Estimated loss is the estimated principal loss...
12 EstimatedReturn The estimated return assigned to the listing a...
13 ProsperRating (numeric) The Prosper Rating assigned at the time the l...
14 ProsperRating (Alpha) The Prosper Rating assigned at the time the li...
15 ProsperScore A custom risk score built using historical Pro...
16 ListingCategory The category of the listing that the borrower ...
17 BorrowerState The two letter abbreviation of the state of th...
18 Occupation The Occupation selected by the Borrower at the...
19 EmploymentStatus The employment status of the borrower at the t...
20 EmploymentStatusDuration The length in months of the employment status ...
21 IsBorrowerHomeowner A Borrower will be classified as a homowner if...
22 CurrentlyInGroup Specifies whether or not the Borrower was in a...
23 GroupKey The Key of the group in which the Borrower is ...
24 DateCreditPulled The date the credit profile was pulled.
25 CreditScoreRangeLower The lower value representing the range of the ...
26 CreditScoreRangeUpper The upper value representing the range of the ...
27 FirstRecordedCreditLine The date the first credit line was opened.
28 CurrentCreditLines Number of current credit lines at the time the...
29 OpenCreditLines Number of open credit lines at the time the cr...
30 TotalCreditLinespast7years Number of credit lines in the past seven years...
31 OpenRevolvingAccounts Number of open revolving accounts at the time ...
32 OpenRevolvingMonthlyPayment Monthly payment on revolving accounts at the t...
33 InquiriesLast6Months Number of inquiries in the past six months at ...
34 TotalInquiries Total number of inquiries at the time the cred...
35 CurrentDelinquencies Number of accounts delinquent at the time the ...
36 AmountDelinquent Dollars delinquent at the time the credit prof...
37 DelinquenciesLast7Years Number of delinquencies in the past 7 years at...
38 PublicRecordsLast10Years Number of public records in the past 10 years ...
39 PublicRecordsLast12Months Number of public records in the past 12 months...
40 RevolvingCreditBalance Dollars of revolving credit at the time the cr...
41 BankcardUtilization The percentage of available revolving credit t...
42 AvailableBankcardCredit The total available credit via bank card at th...
43 TotalTrades Number of trade lines ever opened at the time ...
44 TradesNeverDelinquent Number of trades that have never been delinque...
45 TradesOpenedLast6Months Number of trades opened in the last 6 months a...
46 DebtToIncomeRatio The debt to income ratio of the borrower at th...
47 IncomeRange The income range of the borrower at the time t...
48 IncomeVerifiable The borrower indicated they have the required ...
49 StatedMonthlyIncome The monthly income the borrower stated at the ...
50 LoanKey Unique key for each loan. This is the same key...
51 TotalProsperLoans Number of Prosper loans the borrower at the ti...
52 TotalProsperPaymentsBilled Number of on time payments the borrower made o...
53 OnTimeProsperPayments Number of on time payments the borrower had ma...
54 ProsperPaymentsLessThanOneMonthLate Number of payments the borrower made on Prospe...
55 ProsperPaymentsOneMonthPlusLate Number of payments the borrower made on Prospe...
56 ProsperPrincipalBorrowed Total principal borrowed on Prosper loans at t...
57 ProsperPrincipalOutstanding Principal outstanding on Prosper loans at the ...
58 ScorexChangeAtTimeOfListing Borrower's credit score change at the time the...
59 LoanCurrentDaysDelinquent The number of days delinquent.
60 LoanFirstDefaultedCycleNumber The cycle the loan was charged off. If the loa...
61 LoanMonthsSinceOrigination Number of months since the loan originated.
62 LoanNumber Unique numeric value associated with the loan.
63 LoanOriginalAmount The origination amount of the loan.
64 LoanOriginationDate The date the loan was originated.
65 LoanOriginationQuarter The quarter in which the loan was originated.
66 MemberKey The unique key that is associated with the bor...
67 MonthlyLoanPayment The scheduled monthly loan payment.
68 LP_CustomerPayments Pre charge-off cumulative gross payments made ...
69 LP_CustomerPrincipalPayments Pre charge-off cumulative principal payments m...
70 LP_InterestandFees Pre charge-off cumulative interest and fees pa...
71 LP_ServiceFees Cumulative service fees paid by the investors ...
72 LP_CollectionFees Cumulative collection fees paid by the investo...
73 LP_GrossPrincipalLoss The gross charged off amount of the loan.
74 LP_NetPrincipalLoss The principal that remains uncollected after a...
75 LP_NonPrincipalRecoverypayments The interest and fee component of any recovery...
76 PercentFunded Percent the listing was funded.
77 Recommendations Number of recommendations the borrower had at ...
78 InvestmentFromFriendsCount Number of friends that made an investment in t...
79 InvestmentFromFriendsAmount Dollar amount of investments that were made by...
80 Investors The number of investors that funded the loan.
In [4]:
# Loading the dataset and parsing the dates columns
date_columns = ['ListingCreationDate', 'ClosedDate', 'DateCreditPulled', 'FirstRecordedCreditLine', 'LoanOriginationDate']
loan = pd.read_csv('data/prosperLoanData.csv', parse_dates=date_columns)
In [5]:
# Display the shape of the data
loan.shape
Out[5]:
(113937, 81)
In [6]:
# Display dataset summary statistics
loan.describe().T
Out[6]:
count mean std min 25% 50% 75% max
ListingNumber 113937.00000 627885.69251 328076.23639 4.00000 400919.00000 600554.00000 892634.00000 1255725.00000
Term 113937.00000 40.83025 10.43621 12.00000 36.00000 36.00000 36.00000 60.00000
BorrowerAPR 113912.00000 0.21883 0.08036 0.00653 0.15629 0.20976 0.28381 0.51229
BorrowerRate 113937.00000 0.19276 0.07482 0.00000 0.13400 0.18400 0.25000 0.49750
LenderYield 113937.00000 0.18270 0.07452 -0.01000 0.12420 0.17300 0.24000 0.49250
EstimatedEffectiveYield 84853.00000 0.16866 0.06847 -0.18270 0.11567 0.16150 0.22430 0.31990
EstimatedLoss 84853.00000 0.08031 0.04676 0.00490 0.04240 0.07240 0.11200 0.36600
EstimatedReturn 84853.00000 0.09607 0.03040 -0.18270 0.07408 0.09170 0.11660 0.28370
ProsperRating (numeric) 84853.00000 4.07224 1.67323 1.00000 3.00000 4.00000 5.00000 7.00000
ProsperScore 84853.00000 5.95007 2.37650 1.00000 4.00000 6.00000 8.00000 11.00000
ListingCategory (numeric) 113937.00000 2.77421 3.99680 0.00000 1.00000 1.00000 3.00000 20.00000
EmploymentStatusDuration 106312.00000 96.07158 94.48061 0.00000 26.00000 67.00000 137.00000 755.00000
CreditScoreRangeLower 113346.00000 685.56773 66.45827 0.00000 660.00000 680.00000 720.00000 880.00000
CreditScoreRangeUpper 113346.00000 704.56773 66.45827 19.00000 679.00000 699.00000 739.00000 899.00000
CurrentCreditLines 106333.00000 10.31719 5.45787 0.00000 7.00000 10.00000 13.00000 59.00000
OpenCreditLines 106333.00000 9.26016 5.02264 0.00000 6.00000 9.00000 12.00000 54.00000
TotalCreditLinespast7years 113240.00000 26.75454 13.63787 2.00000 17.00000 25.00000 35.00000 136.00000
OpenRevolvingAccounts 113937.00000 6.96979 4.63097 0.00000 4.00000 6.00000 9.00000 51.00000
OpenRevolvingMonthlyPayment 113937.00000 398.29216 447.15971 0.00000 114.00000 271.00000 525.00000 14985.00000
InquiriesLast6Months 113240.00000 1.43508 2.43751 0.00000 0.00000 1.00000 2.00000 105.00000
TotalInquiries 112778.00000 5.58440 6.42995 0.00000 2.00000 4.00000 7.00000 379.00000
CurrentDelinquencies 113240.00000 0.59205 1.97871 0.00000 0.00000 0.00000 0.00000 83.00000
AmountDelinquent 106315.00000 984.50706 7158.27016 0.00000 0.00000 0.00000 0.00000 463881.00000
DelinquenciesLast7Years 112947.00000 4.15498 10.16022 0.00000 0.00000 0.00000 3.00000 99.00000
PublicRecordsLast10Years 113240.00000 0.31265 0.72787 0.00000 0.00000 0.00000 0.00000 38.00000
PublicRecordsLast12Months 106333.00000 0.01509 0.15409 0.00000 0.00000 0.00000 0.00000 20.00000
RevolvingCreditBalance 106333.00000 17598.70675 32936.40179 0.00000 3121.00000 8549.00000 19521.00000 1435667.00000
BankcardUtilization 106333.00000 0.56131 0.31792 0.00000 0.31000 0.60000 0.84000 5.95000
AvailableBankcardCredit 106393.00000 11210.22545 19818.36131 0.00000 880.00000 4100.00000 13180.00000 646285.00000
TotalTrades 106393.00000 23.23003 11.87131 0.00000 15.00000 22.00000 30.00000 126.00000
TradesNeverDelinquent (percentage) 106393.00000 0.88590 0.14818 0.00000 0.82000 0.94000 1.00000 1.00000
TradesOpenedLast6Months 106393.00000 0.80233 1.09764 0.00000 0.00000 0.00000 1.00000 20.00000
DebtToIncomeRatio 105383.00000 0.27595 0.55176 0.00000 0.14000 0.22000 0.32000 10.01000
StatedMonthlyIncome 113937.00000 5608.02557 7478.49728 0.00000 3200.33333 4666.66667 6825.00000 1750002.91667
TotalProsperLoans 22085.00000 1.42110 0.76404 0.00000 1.00000 1.00000 2.00000 8.00000
TotalProsperPaymentsBilled 22085.00000 22.93434 19.24958 0.00000 9.00000 16.00000 33.00000 141.00000
OnTimeProsperPayments 22085.00000 22.27195 18.83042 0.00000 9.00000 15.00000 32.00000 141.00000
ProsperPaymentsLessThanOneMonthLate 22085.00000 0.61363 2.44683 0.00000 0.00000 0.00000 0.00000 42.00000
ProsperPaymentsOneMonthPlusLate 22085.00000 0.04854 0.55629 0.00000 0.00000 0.00000 0.00000 21.00000
ProsperPrincipalBorrowed 22085.00000 8472.31196 7395.50765 0.00000 3500.00000 6000.00000 11000.00000 72499.00000
ProsperPrincipalOutstanding 22085.00000 2930.31391 3806.63508 0.00000 0.00000 1626.55000 4126.72000 23450.95000
ScorexChangeAtTimeOfListing 18928.00000 -3.22321 50.06357 -209.00000 -35.00000 -3.00000 25.00000 286.00000
LoanCurrentDaysDelinquent 113937.00000 152.81654 466.32025 0.00000 0.00000 0.00000 0.00000 2704.00000
LoanFirstDefaultedCycleNumber 16952.00000 16.26846 9.00590 0.00000 9.00000 14.00000 22.00000 44.00000
LoanMonthsSinceOrigination 113937.00000 31.89688 29.97418 0.00000 6.00000 21.00000 65.00000 100.00000
LoanNumber 113937.00000 69444.47427 38930.47961 1.00000 37332.00000 68599.00000 101901.00000 136486.00000
LoanOriginalAmount 113937.00000 8337.01385 6245.80058 1000.00000 4000.00000 6500.00000 12000.00000 35000.00000
MonthlyLoanPayment 113937.00000 272.47578 192.69781 0.00000 131.62000 217.74000 371.58000 2251.51000
LP_CustomerPayments 113937.00000 4183.07949 4790.90723 -2.34990 1005.76000 2583.83000 5548.40000 40702.39000
LP_CustomerPrincipalPayments 113937.00000 3105.53659 4069.52767 0.00000 500.89000 1587.50000 4000.00000 35000.00000
LP_InterestandFees 113937.00000 1077.54290 1183.41417 -2.34990 274.87000 700.84010 1458.54000 15617.03000
LP_ServiceFees 113937.00000 -54.72564 60.67542 -664.87000 -73.18000 -34.44000 -13.92000 32.06000
LP_CollectionFees 113937.00000 -14.24270 109.23276 -9274.75000 0.00000 0.00000 0.00000 0.00000
LP_GrossPrincipalLoss 113937.00000 700.44634 2388.51383 -94.20000 0.00000 0.00000 0.00000 25000.00000
LP_NetPrincipalLoss 113937.00000 681.42050 2357.16707 -954.55000 0.00000 0.00000 0.00000 25000.00000
LP_NonPrincipalRecoverypayments 113937.00000 25.14269 275.65794 0.00000 0.00000 0.00000 0.00000 21117.90000
PercentFunded 113937.00000 0.99858 0.01792 0.70000 1.00000 1.00000 1.00000 1.01250
Recommendations 113937.00000 0.04803 0.33235 0.00000 0.00000 0.00000 0.00000 39.00000
InvestmentFromFriendsCount 113937.00000 0.02346 0.23241 0.00000 0.00000 0.00000 0.00000 33.00000
InvestmentFromFriendsAmount 113937.00000 16.55075 294.54542 0.00000 0.00000 0.00000 0.00000 25000.00000
Investors 113937.00000 80.47523 103.23902 1.00000 2.00000 44.00000 115.00000 1189.00000
In [7]:
# Disply df info
loan.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
 #   Column                               Non-Null Count   Dtype         
---  ------                               --------------   -----         
 0   ListingKey                           113937 non-null  object        
 1   ListingNumber                        113937 non-null  int64         
 2   ListingCreationDate                  113937 non-null  datetime64[ns]
 3   CreditGrade                          28953 non-null   object        
 4   Term                                 113937 non-null  int64         
 5   LoanStatus                           113937 non-null  object        
 6   ClosedDate                           55089 non-null   datetime64[ns]
 7   BorrowerAPR                          113912 non-null  float64       
 8   BorrowerRate                         113937 non-null  float64       
 9   LenderYield                          113937 non-null  float64       
 10  EstimatedEffectiveYield              84853 non-null   float64       
 11  EstimatedLoss                        84853 non-null   float64       
 12  EstimatedReturn                      84853 non-null   float64       
 13  ProsperRating (numeric)              84853 non-null   float64       
 14  ProsperRating (Alpha)                84853 non-null   object        
 15  ProsperScore                         84853 non-null   float64       
 16  ListingCategory (numeric)            113937 non-null  int64         
 17  BorrowerState                        108422 non-null  object        
 18  Occupation                           110349 non-null  object        
 19  EmploymentStatus                     111682 non-null  object        
 20  EmploymentStatusDuration             106312 non-null  float64       
 21  IsBorrowerHomeowner                  113937 non-null  bool          
 22  CurrentlyInGroup                     113937 non-null  bool          
 23  GroupKey                             13341 non-null   object        
 24  DateCreditPulled                     113937 non-null  datetime64[ns]
 25  CreditScoreRangeLower                113346 non-null  float64       
 26  CreditScoreRangeUpper                113346 non-null  float64       
 27  FirstRecordedCreditLine              113240 non-null  datetime64[ns]
 28  CurrentCreditLines                   106333 non-null  float64       
 29  OpenCreditLines                      106333 non-null  float64       
 30  TotalCreditLinespast7years           113240 non-null  float64       
 31  OpenRevolvingAccounts                113937 non-null  int64         
 32  OpenRevolvingMonthlyPayment          113937 non-null  float64       
 33  InquiriesLast6Months                 113240 non-null  float64       
 34  TotalInquiries                       112778 non-null  float64       
 35  CurrentDelinquencies                 113240 non-null  float64       
 36  AmountDelinquent                     106315 non-null  float64       
 37  DelinquenciesLast7Years              112947 non-null  float64       
 38  PublicRecordsLast10Years             113240 non-null  float64       
 39  PublicRecordsLast12Months            106333 non-null  float64       
 40  RevolvingCreditBalance               106333 non-null  float64       
 41  BankcardUtilization                  106333 non-null  float64       
 42  AvailableBankcardCredit              106393 non-null  float64       
 43  TotalTrades                          106393 non-null  float64       
 44  TradesNeverDelinquent (percentage)   106393 non-null  float64       
 45  TradesOpenedLast6Months              106393 non-null  float64       
 46  DebtToIncomeRatio                    105383 non-null  float64       
 47  IncomeRange                          113937 non-null  object        
 48  IncomeVerifiable                     113937 non-null  bool          
 49  StatedMonthlyIncome                  113937 non-null  float64       
 50  LoanKey                              113937 non-null  object        
 51  TotalProsperLoans                    22085 non-null   float64       
 52  TotalProsperPaymentsBilled           22085 non-null   float64       
 53  OnTimeProsperPayments                22085 non-null   float64       
 54  ProsperPaymentsLessThanOneMonthLate  22085 non-null   float64       
 55  ProsperPaymentsOneMonthPlusLate      22085 non-null   float64       
 56  ProsperPrincipalBorrowed             22085 non-null   float64       
 57  ProsperPrincipalOutstanding          22085 non-null   float64       
 58  ScorexChangeAtTimeOfListing          18928 non-null   float64       
 59  LoanCurrentDaysDelinquent            113937 non-null  int64         
 60  LoanFirstDefaultedCycleNumber        16952 non-null   float64       
 61  LoanMonthsSinceOrigination           113937 non-null  int64         
 62  LoanNumber                           113937 non-null  int64         
 63  LoanOriginalAmount                   113937 non-null  int64         
 64  LoanOriginationDate                  113937 non-null  datetime64[ns]
 65  LoanOriginationQuarter               113937 non-null  object        
 66  MemberKey                            113937 non-null  object        
 67  MonthlyLoanPayment                   113937 non-null  float64       
 68  LP_CustomerPayments                  113937 non-null  float64       
 69  LP_CustomerPrincipalPayments         113937 non-null  float64       
 70  LP_InterestandFees                   113937 non-null  float64       
 71  LP_ServiceFees                       113937 non-null  float64       
 72  LP_CollectionFees                    113937 non-null  float64       
 73  LP_GrossPrincipalLoss                113937 non-null  float64       
 74  LP_NetPrincipalLoss                  113937 non-null  float64       
 75  LP_NonPrincipalRecoverypayments      113937 non-null  float64       
 76  PercentFunded                        113937 non-null  float64       
 77  Recommendations                      113937 non-null  int64         
 78  InvestmentFromFriendsCount           113937 non-null  int64         
 79  InvestmentFromFriendsAmount          113937 non-null  float64       
 80  Investors                            113937 non-null  int64         
dtypes: bool(3), datetime64[ns](5), float64(50), int64(11), object(12)
memory usage: 68.1+ MB
In [8]:
# Display the number of unique values for the string columns
loan.select_dtypes(include=['object']).nunique()
Out[8]:
ListingKey                113066
CreditGrade                    8
LoanStatus                    12
ProsperRating (Alpha)          7
BorrowerState                 51
Occupation                    67
EmploymentStatus               8
GroupKey                     706
IncomeRange                    8
LoanKey                   113066
LoanOriginationQuarter        33
MemberKey                  90831
dtype: int64
In [9]:
# Display any duplicated values if any and the number of duplicates
print(f'number of duplicated rows: {loan[loan.duplicated()].shape[0]}')
loan[loan.duplicated()]
number of duplicated rows: 0
Out[9]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperRating (Alpha) ProsperScore ListingCategory (numeric) BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter MemberKey MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
In [10]:
# Number of loans for every year
loan.ListingCreationDate.dt.year.value_counts()
Out[10]:
2013    35413
2012    19556
2007    11557
2011    11442
2008    11263
2014    10734
2006     6213
2010     5530
2009     2206
2005       23
Name: ListingCreationDate, dtype: int64
In [11]:
# Display the IncomeRange Value Counts
loan.IncomeRange.value_counts()
Out[11]:
$25,000-49,999    32192
$50,000-74,999    31050
$100,000+         17337
$75,000-99,999    16916
Not displayed      7741
$1-24,999          7274
Not employed        806
$0                  621
Name: IncomeRange, dtype: int64
In [12]:
# Display Term Value Counts
loan.Term.value_counts()
Out[12]:
36    87778
60    24545
12     1614
Name: Term, dtype: int64
In [13]:
# Display EmployementStatus Value Counts
loan.EmploymentStatus.value_counts()
Out[13]:
Employed         67322
Full-time        26355
Self-employed     6134
Not available     5347
Other             3806
Part-time         1088
Not employed       835
Retired            795
Name: EmploymentStatus, dtype: int64
In [14]:
# Display ProsperScore Value Counts
loan.ProsperScore.value_counts().sort_index()
Out[14]:
1.00000       992
2.00000      5766
3.00000      7642
4.00000     12595
5.00000      9813
6.00000     12278
7.00000     10597
8.00000     12053
9.00000      6911
10.00000     4750
11.00000     1456
Name: ProsperScore, dtype: int64
In [15]:
# Display a sample from where ProsperScore 11
loan[loan.ProsperScore == 11].sample(10)
Out[15]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperRating (Alpha) ProsperScore ListingCategory (numeric) BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter MemberKey MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
38736 9059359744187442813B32C 1053334 2013-12-17 16:38:08.783 NaN 36 Current NaT 0.07620 0.06290 0.05290 0.05221 0.00990 0.04231 7.00000 AA 11.00000 7 MN Computer Programmer Employed 186.00000 True False NaN 2013-12-17 16:38:11 780.00000 799.00000 1996-09-21 12.00000 10.00000 28.00000 6 100.00000 0.00000 3.00000 0.00000 0.00000 0.00000 0.00000 0.00000 277.00000 0.01000 14524.00000 24.00000 0.87000 0.00000 0.14000 $75,000-99,999 True 7500.00000 C5F93701508613307EB215D nan nan nan nan nan nan nan nan 0 nan 3 123647 20000 2013-12-27 Q4 2013 DB603701951974900E95199 611.07000 1222.14000 1011.15000 210.99000 -33.55000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 178
56471 A7283592348294721E2590A 942575 2013-10-08 09:49:54.087 NaN 36 Current NaT 0.14816 0.11990 0.10990 0.10568 0.03490 0.07078 6.00000 A 11.00000 1 FL Executive Employed 16.00000 False False NaN 2013-10-08 09:49:57 660.00000 679.00000 1995-06-22 11.00000 9.00000 52.00000 11 220.00000 0.00000 10.00000 0.00000 0.00000 9.00000 1.00000 0.00000 2278.00000 0.22000 7922.00000 46.00000 0.78000 0.00000 0.11000 $75,000-99,999 True 7500.00000 FB89369626146178404077E nan nan nan nan nan nan nan nan 0 nan 5 105767 4000 2013-10-15 Q4 2013 5B613696044039930C62A70 132.84000 530.04600 375.41000 154.63600 -13.01000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 1
66892 86243600669821455BE25D8 1117060 2014-01-17 07:32:41.137 NaN 36 Current NaT 0.20524 0.16850 0.15850 0.14814 0.06990 0.07824 4.00000 C 11.00000 2 FL Computer Programmer Other 22.00000 False False NaN 2014-01-17 07:32:43 660.00000 679.00000 1993-10-08 6.00000 6.00000 31.00000 2 25.00000 0.00000 14.00000 1.00000 2530.00000 0.00000 0.00000 0.00000 1508.00000 0.00000 0.00000 18.00000 0.84000 0.00000 0.43000 $25,000-49,999 True 2100.00000 DD8D3703044881871A27C50 3.00000 52.00000 36.00000 16.00000 0.00000 12000.00000 3123.36000 nan 0 nan 2 127245 4000 2014-01-22 Q1 2014 057834814391860466E09CB 142.31000 484.15340 446.71000 37.44340 -2.34000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 1
30611 587A360313202632127EC5E 1241877 2014-03-03 09:00:12.487 NaN 60 Current NaT 0.15817 0.13500 0.12500 0.11858 0.04990 0.06868 5.00000 B 11.00000 1 WI Other Other 49.00000 True False NaN 2014-03-03 09:00:15 660.00000 679.00000 1977-09-01 8.00000 8.00000 24.00000 4 131.00000 0.00000 2.00000 0.00000 0.00000 10.00000 2.00000 0.00000 4205.00000 0.88000 479.00000 22.00000 0.77000 1.00000 0.23000 $50,000-74,999 True 5166.66667 1D1737082547452187141C1 2.00000 25.00000 25.00000 0.00000 0.00000 16000.00000 10122.19000 nan 0 nan 0 134583 12000 2014-03-06 Q1 2014 F9163538425180659109B0E 276.12000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 1
87444 518B36007989246989DEBE3 1135676 2014-01-15 14:17:52.120 NaN 36 Current NaT 0.09434 0.08090 0.07090 0.06919 0.01990 0.04929 7.00000 AA 11.00000 1 CA Sales - Commission Employed 70.00000 True False NaN 2014-01-15 14:17:54 780.00000 799.00000 1989-08-17 11.00000 11.00000 32.00000 8 624.00000 0.00000 2.00000 0.00000 0.00000 0.00000 0.00000 0.00000 24849.00000 0.37000 40551.00000 27.00000 1.00000 0.00000 0.26000 $100,000+ True 10833.33333 0129370416010133185C556 nan nan nan nan nan nan nan nan 0 nan 2 127212 25000 2014-01-22 Q1 2014 4924370298355383291E7CD 784.45000 784.45000 612.68000 171.77000 -21.23000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 366
110070 6B583602713428710F18177 1149484 2014-02-09 08:00:42.180 NaN 36 Current NaT 0.14243 0.11390 0.10390 0.09989 0.03490 0.06499 6.00000 A 11.00000 7 MI Engineer - Mechanical Employed 20.00000 False False NaN 2014-02-09 08:00:44 660.00000 679.00000 1994-07-18 10.00000 9.00000 31.00000 7 187.00000 1.00000 3.00000 1.00000 1792.00000 8.00000 2.00000 0.00000 13237.00000 0.38000 21563.00000 27.00000 0.81000 0.00000 0.18000 $75,000-99,999 True 8000.00000 7A4837053213881616C1588 1.00000 17.00000 17.00000 0.00000 0.00000 15000.00000 12266.87000 nan 0 nan 1 131079 12750 2014-02-12 Q1 2014 25CC3406594233204273873 419.78000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 1
37232 0F663593526719681D86F42 953128 2013-10-21 15:54:21.067 NaN 36 Current NaT 0.06726 0.06050 0.05050 0.05001 0.00740 0.04261 7.00000 AA 11.00000 2 NY Other Employed 41.00000 True False NaN 2013-10-21 15:54:24 760.00000 779.00000 1993-10-22 15.00000 15.00000 46.00000 5 167.00000 1.00000 3.00000 0.00000 0.00000 0.00000 0.00000 0.00000 3650.00000 0.32000 7450.00000 26.00000 1.00000 0.00000 0.25000 $50,000-74,999 True 4833.33333 1BB03696838165732F33561 3.00000 80.00000 38.00000 42.00000 0.00000 10158.00000 5076.35000 nan 0 nan 4 108896 15000 2013-11-05 Q4 2013 A79E3382252261518835F53 456.67000 1370.01000 1147.16000 222.85000 -36.84000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 269
77993 E77D3600560127461571692 1124923 2014-01-21 12:46:38.837 NaN 60 Current NaT 0.16304 0.14000 0.13000 0.12334 0.04990 0.07344 5.00000 B 11.00000 18 CA Computer Programmer Employed 75.00000 False False NaN 2014-01-21 12:46:41 660.00000 679.00000 1998-09-11 16.00000 14.00000 26.00000 9 815.00000 0.00000 7.00000 0.00000 0.00000 0.00000 0.00000 0.00000 30657.00000 0.73000 10622.00000 22.00000 1.00000 0.00000 0.22000 $100,000+ True 11870.00000 F9E2370390254133796B905 3.00000 96.00000 96.00000 0.00000 0.00000 13500.00000 896.23000 nan 0 nan 2 128590 13000 2014-01-23 Q1 2014 59513378901003572EB464D 302.49000 297.50370 147.91000 149.59370 -11.04000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 1
48433 6F3D3601560728935768E9D 1159450 2014-02-12 17:37:39.610 NaN 36 Current NaT 0.11548 0.08740 0.07740 0.07533 0.02240 0.05293 6.00000 A 11.00000 1 IL Computer Programmer Employed 8.00000 True False NaN 2014-02-12 17:37:42 740.00000 759.00000 1974-05-01 18.00000 17.00000 52.00000 17 690.00000 0.00000 6.00000 0.00000 0.00000 0.00000 0.00000 0.00000 25641.00000 0.20000 95029.00000 48.00000 0.97000 2.00000 0.27000 $75,000-99,999 True 7916.66667 871C3706803925720E80567 nan nan nan nan nan nan nan nan 0 nan 0 134489 25000 2014-03-04 Q1 2014 748A37056315900220C7DED 791.97000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 1
12747 E7B03593553258167E5CA68 981017 2013-10-30 18:13:35.553 NaN 36 Current NaT 0.06726 0.06050 0.05050 0.05001 0.00740 0.04261 7.00000 AA 11.00000 1 MN Sales - Commission Employed 84.00000 True True NaN 2013-10-30 18:13:39 760.00000 779.00000 2001-08-27 15.00000 11.00000 37.00000 7 179.00000 4.00000 7.00000 0.00000 0.00000 0.00000 0.00000 0.00000 16151.00000 0.25000 42978.00000 33.00000 1.00000 0.00000 0.22000 $100,000+ True 8750.00000 8FA136975209651139A4446 1.00000 35.00000 35.00000 0.00000 0.00000 1000.00000 0.00000 nan 0 nan 4 108909 18000 2013-11-07 Q4 2013 0ED433950885177559D10EB 548.00000 1644.00000 1376.58000 267.42000 -44.20000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 336
In [16]:
# Missing values from LoanStatus or BorrowerAPR
loan[['BorrowerAPR', 'LoanStatus']].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 2 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   BorrowerAPR  113912 non-null  float64
 1   LoanStatus   113937 non-null  object 
dtypes: float64(1), object(1)
memory usage: 1.7+ MB
In [17]:
# Display the rows without APR rate
loan[loan['BorrowerAPR'].isnull()]
Out[17]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperRating (Alpha) ProsperScore ListingCategory (numeric) BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter MemberKey MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
2543 044B3365298516680DA929B 36 2005-11-28 22:42:42.557 HR 36 Completed 2006-01-06 nan 0.10000 0.09500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False 5BE63365249159793785758 2005-11-28 22:42:42.557 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.02685 Not displayed True 8333.33333 C1C4336516803221710615B nan nan nan nan nan nan nan nan 0 nan 99 19 4200 2005-12-12 Q4 2005 10983364491040266AF6111 99.40000 4227.03000 4200.00000 27.03000 -4.02000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 5
14632 E4103364750435927B6F892 67 2006-01-06 12:54:51.887 AA 36 Completed 2006-04-10 nan 0.06500 0.05500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False ACFB3365095848727895936 2006-01-06 12:54:51.887 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.09031 Not displayed True 4166.66667 191C3365353954529711DE0 nan nan nan nan nan nan nan nan 0 nan 98 24 2620 2006-01-10 Q1 2006 12C53364471219226F478E8 80.30000 2657.72000 2620.00000 37.72000 -2.59000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 5
25166 B3083364738592238DBDA63 6 2005-11-15 16:06:47.457 AA 36 Completed 2005-11-29 nan 0.04350 0.03850 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False NaN 2005-11-15 16:06:47.457 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.04120 Not displayed True 9166.66667 C29A3366376042420E5D6F4 nan nan nan nan nan nan nan nan 0 nan 100 5 1000 2005-11-23 Q4 2005 92263365124033374675E8D 29.68000 1000.72000 1000.00000 0.72000 -0.42000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 3
25330 B6C33364979883097BBCD77 39 2005-11-29 13:29:16.810 AA 36 Completed 2006-03-21 nan 0.06395 0.05520 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False BA973366260114148BF757A 2005-11-29 09:42:00.007 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.07707 Not displayed True 7083.33333 A5A53366276781033D5C4FF nan nan nan nan nan nan nan nan 0 nan 99 18 1500 2005-12-08 Q4 2005 D3213365072473957A40CF8 45.90000 1519.59000 1500.00000 19.59000 -1.30000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 4
29773 3480336511078238810A782 21 2005-11-18 14:48:19.753 AA 36 Completed 2008-06-05 nan 0.06000 0.05500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False 94E9336577086235891524E 2005-11-18 14:48:19.753 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.01923 Not displayed True 8333.33333 44DA3365766842096917FF8 nan nan nan nan nan nan nan nan 0 nan 99 14 3000 2005-12-05 Q4 2005 0FE0336637558007610834C 91.27000 3203.37000 3000.00000 203.37000 -16.08000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 10
39870 B6E73366056348939D8801F 24 2005-11-20 14:45:55.193 C 36 Completed 2006-10-24 nan 0.12000 0.11500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False 12D7336581480170815332C 2005-11-20 14:45:55.193 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.29069 Not displayed True 3333.33333 464E3365851398474A175A8 nan nan nan nan nan nan nan nan 0 nan 100 6 15000 2005-11-25 Q4 2005 B4183364632797286CA209B 498.21000 16445.53000 15000.00000 1445.53000 -53.96000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 6
43587 2F123364529418907A58D4C 19 2005-11-18 14:16:59.093 AA 36 Completed 2005-11-29 nan 0.08500 0.07500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False NaN 2005-11-16 10:49:25.710 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.01510 Not displayed True 18333.33333 051C3366339161583A81E4D nan nan nan nan nan nan nan nan 0 nan 100 10 5000 2005-11-28 Q4 2005 4C9A3364566879406D66E65 157.84000 5001.10000 5000.00000 1.10000 -1.10000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 6
62297 2F25336514614362295DA03 7 2005-11-15 18:25:56.107 E 36 Completed 2005-11-30 nan 0.17000 0.11500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False 5BE63365249159793785758 2005-11-09 00:30:04.487 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.01783 Not displayed True 10000.00000 81EF336558129564212843B nan nan nan nan nan nan nan nan 0 nan 100 2 1000 2005-11-18 Q4 2005 0A8633658381202043D0226 35.65000 1003.95000 1000.00000 3.95000 -0.42000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 1
71280 DED033645500376547AF34E 60 2005-12-29 19:53:41.000 C 36 Completed 2008-04-21 nan 0.09000 0.08500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False AD593366266318525BDAB73 2005-12-16 16:02:52.027 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.37002 Not displayed True 7083.33333 05A133660705896823D1EF3 nan nan nan nan nan nan nan nan 0 nan 98 23 10000 2006-01-10 Q1 2006 F8E13365613650624A93FA5 318.00000 11349.06000 10000.00000 1349.06000 -74.98000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 8
71875 F50A33647261871547E3E0D 68 2006-01-06 13:05:50.890 AA 36 Completed 2006-07-11 nan 0.07000 0.06500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False 12D7336581480170815332C 2006-01-06 13:05:50.890 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.02343 Not displayed True 4166.66667 84153365247069237E66630 nan nan nan nan nan nan nan nan 0 nan 98 25 3000 2006-01-10 Q1 2006 ED983364564478611B1273E 92.63000 3093.73000 3000.00000 93.73000 -6.01000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 1
73945 8D5F33655433257693D8033 18 2005-11-18 14:10:50.470 HR 36 Completed 2006-01-06 nan 0.25000 0.24500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False NaN 2005-11-18 14:10:50.470 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.01234 Not displayed True 29166.66667 9B343364431011460DD5847 nan nan nan nan nan nan nan nan 0 nan 100 4 2500 2005-11-22 Q4 2005 DDF8336636770758171C82D 99.40000 2576.56000 2500.00000 76.56000 -2.08000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 1
74114 0B4133652604109810CAA3B 34 2005-11-28 16:16:35.077 AA 36 Completed 2008-12-08 nan 0.06000 0.05500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False NaN 2005-11-28 16:16:35.077 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.01115 Not displayed True 9458.33333 32233364725508802D1C433 nan nan nan nan nan nan nan nan 0 nan 99 16 1000 2005-12-07 Q4 2005 5DDE33646556425236DA3CB 30.42000 1095.21000 1000.00000 95.21000 -7.61000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 7
75710 F7C53365521712881842072 25 2005-11-20 20:15:47.240 C 36 Completed 2005-11-30 nan 0.10000 0.07500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False F0B53365823807576457B84 2005-11-16 11:25:26.897 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.09451 Not displayed True 6666.66667 5B013365507761133BD470D nan nan nan nan nan nan nan nan 0 nan 100 12 4000 2005-11-29 Q4 2005 A9FB3366058045381B7E12E 129.07000 4000.88000 4000.00000 0.88000 -0.88000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 3
78414 468C3366438848688B8B5B4 22 2005-11-18 16:35:48.210 AA 36 Completed 2006-01-09 nan 0.13000 0.12500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False NaN 2005-11-17 15:13:19.920 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.01289 Not displayed True 8333.33333 9B0C33664264358299699CB nan nan nan nan nan nan nan nan 0 nan 100 3 2000 2005-11-22 Q4 2005 BF2C3365347545468EEC1DC 67.39000 2033.90000 2000.00000 33.90000 -1.68000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 2
80369 54A833661083301426FFC90 44 2005-12-03 14:03:22.207 AA 36 Completed 2006-01-06 nan 0.06500 0.05500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False ACFB3365095848727895936 2005-11-28 10:19:33.010 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.09802 Not displayed True 4166.66667 335E3365194260894C5E804 nan nan nan nan nan nan nan nan 0 nan 99 20 2689 2005-12-15 Q4 2005 12C53364471219226F478E8 82.42000 2698.72000 2689.00000 9.72000 -1.12000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 10
81848 584D3364403680278BB6704 28 2005-11-21 19:35:46.890 D 36 Completed 2008-03-18 nan 0.13000 0.09500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False C846336482206839928D78F 2005-11-21 19:35:46.890 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.38380 Not displayed True 1083.33333 732433648623176525D594A nan nan nan nan nan nan nan nan 0 nan 100 8 4000 2005-11-28 Q4 2005 CDEF33658855538229DD0C5 88.12000 4705.19000 4000.00000 705.19000 -25.67000 -123.32290 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 4
90598 B87F3364623805558A45EAB 16 2005-11-18 13:39:03.263 B 36 Completed 2005-11-29 nan 0.12000 0.11500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False NaN 2005-11-18 13:39:03.263 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.01051 Not displayed True 16666.66667 979E3364988299845A2A448 nan nan nan nan nan nan nan nan 0 nan 100 9 1000 2005-11-28 Q4 2005 99443365049737429B8D4FC 33.21000 1000.33000 1000.00000 0.33000 -0.33000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 4
92516 80373365765621932269C34 48 2005-12-05 13:58:06.017 AA 36 Completed 2008-07-06 nan 0.07000 0.06500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False B8143364846229046768A83 2005-12-05 13:58:06.017 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.04569 Not displayed True 12500.00000 A1DA3366059246188A8B750 nan nan nan nan nan nan nan nan 0 nan 99 15 3500 2005-12-06 Q4 2005 D3123364665672102D89C63 0.00000 3793.45000 3500.00000 293.45000 -23.12000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 5
103509 0E0F336443449038617E9F4 59 2005-12-21 19:37:33.813 A 36 Completed 2008-12-30 nan 0.09450 0.08450 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False F63E33645133664151DBF61 2005-12-20 20:22:52.890 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.07015 Not displayed True 12500.00000 11463365963100969351D1D nan nan nan nan nan nan nan nan 0 nan 99 22 8998 2005-12-30 Q4 2005 6EC63364337535625235677 288.02000 10295.97000 8998.00000 1297.97000 -69.17000 0.00000 0.00000 0.00000 0.00000 1.01100 0 0 0.00000 14
105925 B01233643651103736EE3B6 20 2005-11-18 14:22:34.763 AA 36 Completed 2005-12-28 nan 0.08500 0.07500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False F0B53365823807576457B84 2005-11-18 14:22:34.763 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.02743 Not displayed True 6666.66667 EB8E3365902019869B4F403 nan nan nan nan nan nan nan nan 0 nan 100 11 5000 2005-11-28 Q4 2005 CAF83365574931485C12441 157.84000 5003.60000 5000.00000 3.60000 -2.38000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 6
106465 B8D63366141679436131B1C 23 2005-11-20 13:21:31.993 NC 36 Completed 2006-01-06 nan 0.09000 0.08500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False 12D7336581480170815332C 2005-11-09 12:01:07.900 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.03826 Not displayed True 4166.66667 3FCC336526459619860F893 nan nan nan nan nan nan nan nan 0 nan 100 7 4800 2005-11-25 Q4 2005 ED983364564478611B1273E 152.64000 4849.36000 4800.00000 49.36000 -4.00000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 1
106735 97B333660731451030726AF 5 2005-11-11 08:46:17.303 AA 36 Completed 2006-02-07 nan 0.05950 0.05450 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False NaN 2005-11-11 08:46:17.303 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.01747 Not displayed True 14583.33333 5FEA3365823284105CDF615 nan nan nan nan nan nan nan nan 0 nan 100 13 1500 2005-11-30 Q4 2005 BE4F33652072224311E37A5 45.60000 1516.58000 1500.00000 16.58000 -0.84000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 5
109027 F59D33647178242776CD53B 40 2005-11-29 20:23:22.770 AA 36 Completed 2006-01-06 nan 0.06000 0.05500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False B8143364846229046768A83 2005-11-28 15:33:37.547 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.02217 Not displayed True 25000.00000 AC4B33651299629964273AE nan nan nan nan nan nan nan nan 0 nan 99 21 3000 2005-12-15 Q4 2005 6FB733664286679602C721D 0.00000 3010.85000 3000.00000 10.85000 -1.24000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 8
111607 84CC3365208034255DDCB47 35 2005-11-28 17:25:43.160 B 36 Completed 2005-12-08 nan 0.06500 0.06000 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False B8143364846229046768A83 2005-11-28 17:25:43.160 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.01404 Not displayed True 16666.66667 31AC3364816494648054FCB nan nan nan nan nan nan nan nan 0 nan 99 17 3000 2005-12-07 Q4 2005 B17A3365017647637E14A1C 91.95000 3000.53000 3000.00000 0.53000 -0.53000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 5
113921 D21A33647075665665BE266 4 2005-11-09 20:44:28.847 AA 36 Completed 2005-11-25 nan 0.04000 0.03500 nan nan nan nan NaN nan 0 NaN NaN NaN nan False False B8143364846229046768A83 2005-11-09 20:44:28.847 nan nan NaT nan nan nan 0 0.00000 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.23284 Not displayed True 12500.00000 30FD3365652573455326F15 nan nan nan nan nan nan nan nan 0 nan 100 1 1000 2005-11-15 Q4 2005 D3123364665672102D89C63 29.52000 1000.89000 1000.00000 0.89000 -0.88000 0.00000 0.00000 0.00000 0.00000 1.00000 0 0 0.00000 1

What is the structure of your dataset?

There are 113,937 observations for prosper loans with 81 features most features are numeric in type (61) float64(50), int64(11), bool(3), datetime64[ns](5), object(12)

What is/are the main feature(s) of interest in your dataset?

Mainly interested in 2 features BorrowerAPR and LoanStatus to help me answer the following questions:

  • What factors affect a loan’s outcome status?
  • What affects the borrower’s APR or interest rate?
  • Are there differences between loans depending on how large the original loan amount was?

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

Variable Description
ListingCreationDate The date the listing was created
Term The length of the loan expressed in months
LoanStatus The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket
BorrowerAPR The Borrower's Annual Percentage Rate (APR) for the loan
ProsperScore A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009
ListingCategory The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
BorrowerState The two letter abbreviation of the state of the address of the borrower at the time the Listing was created
Occupation The Occupation selected by the Borrower at the time they created the listing
EmploymentStatus The employment status of the borrower at the time they posted the listing
EmploymentStatusDuration The length in months of the employment status at the time the listing was created
IsBorrowerHomeowner A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner
CreditScoreRangeLower The lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency
CreditScoreRangeUpper The upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency
IncomeRange The income range of the borrower at the time the listing was created
LoanCurrentDaysDelinquent The number of days delinquent
LoanOriginalAmount The origination amount of the loan

Before I close this section let's do some data cleaning

  • Chose a subset of the data to work with to include just the columns of intrest.
  • Rename ListingCategory (numeric) column to ListingCategory.
  • Replace the numbers of ListingCategory to Categories.
  • Convert the categorical columns into categorical dtype .
  • Drop rows with missing APR values.
  • Drop data from 2005 and 2014 as they are not complete.
  • Create new column for average Credit Score Range and drop the 2 columns for lower and upper bounds.
  • Merge all the past dates for LoanStates into (Past Due).
In [18]:
# List of columns to use with the features of intrest
use_cols = ['ListingCreationDate',
            'Term',
            'LoanStatus', 
            'BorrowerAPR',
            'ProsperScore', 
            'ListingCategory (numeric)', 
            'BorrowerState', 
            'Occupation', 
            'EmploymentStatus', 
            'EmploymentStatusDuration', 
            'IsBorrowerHomeowner', 
            'CreditScoreRangeLower', 
            'CreditScoreRangeUpper', 
            'IncomeRange', 
            'LoanCurrentDaysDelinquent', 
            'LoanOriginalAmount']
In [19]:
# Copy of the data frame to clean it
loan_clean = loan.copy()

# Use only the columns of intrest
loan_clean = loan_clean[use_cols]

# drop the data of 2005 and 2014
loan_clean = loan_clean.loc[~loan_clean.ListingCreationDate.dt.year.isin([2005, 2014])].reset_index().drop('index', axis=1)
In [20]:
# Drop rows with no APR values
loan_clean = loan_clean.loc[~loan_clean['BorrowerAPR'].isnull()].reset_index().drop('index', axis=1)
In [21]:
# Rename the ProsperRating (Alpha) and ListingCategory (numeric) columns
loan_clean = loan_clean.rename(columns={'ListingCategory (numeric)': 'ListingCategory'})
In [22]:
# Replace the numbers of ListingCategory to Categories
listing_category_dict = {0: 'Not Available',
                         1: 'Debt Consolidation', 
                         2: 'Home Improvement', 
                         3: 'Business', 
                         4: 'Personal Loan', 
                         5: 'Student Use', 
                         6: 'Auto',
                         7: 'Other', 
                         8: 'Baby&Adoption', 
                         9: 'Boat', 
                         10: 'Cosmetic Procedure', 
                         11: 'Engagement Ring', 
                         12: 'Green Loans', 
                         13: 'Household Expenses', 
                         14: 'Large Purchases', 
                         15: 'Medical/Dental', 
                         16: 'Motorcycle', 
                         17: 'RV', 
                         18: 'Taxes', 
                         19: 'Vacation',
                         20: 'Wedding Loans'}

loan_clean.ListingCategory.replace(listing_category_dict, inplace=True)
In [23]:
li= ['$100,000+', '$75,000-99,999', '$50,000-74,999', '$25,000-49,999', 
                            '$1-24,999', '$0', 'Not displayed', 'Not employed']
li.reverse()
li
Out[23]:
['Not employed',
 'Not displayed',
 '$0',
 '$1-24,999',
 '$25,000-49,999',
 '$50,000-74,999',
 '$75,000-99,999',
 '$100,000+']
In [24]:
# Dictionary of ordered categories
cat_dict = {'IncomeRange': ['Not employed', 'Not displayed', '$0', '$1-24,999',
                            '$25,000-49,999', '$50,000-74,999', '$75,000-99,999', '$100,000+'],
            'ProsperScore': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
            'Term': [12, 36, 60]}

# List of columns to convert
cat_cols = ['EmploymentStatus', 'LoanStatus', 'ListingCategory', 'BorrowerState', 'Occupation']
In [25]:
# Converting ordered categories columns 

for col in cat_dict:
    ordered_cat = pd.api.types.CategoricalDtype(ordered = True,
                                                categories = cat_dict[col])
    loan_clean[col] = loan_clean[col].astype(ordered_cat)
In [26]:
# Converting other categorical clomuns

for col in cat_cols:
    loan_clean[col] = loan_clean[col].astype('category')
In [27]:
loan_clean['AvgCreditScore'] = (loan_clean.CreditScoreRangeLower + loan_clean.CreditScoreRangeUpper) /2
loan_clean.drop(['CreditScoreRangeLower', 'CreditScoreRangeUpper'], axis=1, inplace=True)
In [28]:
loan_clean.LoanStatus.cat.categories
Out[28]:
Index(['Cancelled', 'Chargedoff', 'Completed', 'Current', 'Defaulted',
       'FinalPaymentInProgress', 'Past Due (1-15 days)',
       'Past Due (16-30 days)', 'Past Due (31-60 days)',
       'Past Due (61-90 days)', 'Past Due (91-120 days)',
       'Past Due (>120 days)'],
      dtype='object')
In [29]:
loan_clean.LoanStatus.replace(to_replace=['Past Due (1-15 days)',
                                          'Past Due (16-30 days)',
                                          'Past Due (31-60 days)',
                                          'Past Due (61-90 days)',
                                          'Past Due (91-120 days)',
                                          'Past Due (>120 days)'],
                             value='Past Due', inplace=True)
In [30]:
loan_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103178 entries, 0 to 103177
Data columns (total 15 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   ListingCreationDate        103178 non-null  datetime64[ns]
 1   Term                       103178 non-null  category      
 2   LoanStatus                 103178 non-null  category      
 3   BorrowerAPR                103178 non-null  float64       
 4   ProsperScore               74119 non-null   category      
 5   ListingCategory            103178 non-null  category      
 6   BorrowerState              97688 non-null   category      
 7   Occupation                 100137 non-null  category      
 8   EmploymentStatus           100948 non-null  category      
 9   EmploymentStatusDuration   95578 non-null   float64       
 10  IsBorrowerHomeowner        103178 non-null  bool          
 11  IncomeRange                103178 non-null  category      
 12  LoanCurrentDaysDelinquent  103178 non-null  int64         
 13  LoanOriginalAmount         103178 non-null  int64         
 14  AvgCreditScore             102612 non-null  float64       
dtypes: bool(1), category(8), datetime64[ns](1), float64(3), int64(2)
memory usage: 5.6 MB
In [31]:
# Save the Cleaned Dataset into csv file
loan_clean.to_csv('data/cleaned_prosper.csv', index=False)

Univariate Exploration

I'll start by looking at the distribution of the main variables of interest: BorrowerApr and LoanStatus.

In [32]:
# Ploting the distribution of BorrowerAPR
fig, ax = plt.subplots(figsize=(16, 6), ncols=2)
sns.histplot(loan_clean.BorrowerAPR, kde=False, color=base_color, binwidth=0.01, ax=ax[0])
sns.despine(left=True);

# Zooming from 0.05 to 0.4
sns.histplot(loan_clean.BorrowerAPR, kde=False, color=base_color, binwidth=0.01, ax=ax[1], binrange=[0.05, 0.4])
sns.despine(left=True);
In [33]:
loan_clean.query('BorrowerAPR < 0.01 | BorrowerAPR > 0.45').sort_values('BorrowerAPR')
Out[33]:
ListingCreationDate Term LoanStatus BorrowerAPR ProsperScore ListingCategory BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner IncomeRange LoanCurrentDaysDelinquent LoanOriginalAmount AvgCreditScore
69559 2007-03-24 10:03:30.717 36 Completed 0.00653 NaN Not Available MI Food Service Management Full-time 152.00000 False $1-24,999 0 3000 669.50000
70958 2006-10-18 06:15:54.327 36 Completed 0.00653 NaN Not Available NJ Other Not available nan False Not displayed 0 5000 609.50000
48447 2006-10-12 06:03:03.183 36 Defaulted 0.00864 NaN Not Available PA Clerical Not available nan False Not displayed 194 2600 489.50000
16597 2006-03-11 15:43:45.393 36 Defaulted 0.45857 NaN Not Available NaN NaN NaN nan False Not displayed 257 3000 nan
94136 2006-03-06 22:36:53.753 36 Defaulted 0.45857 NaN Not Available NaN NaN NaN nan False Not displayed 193 2800 nan
20084 2006-03-12 13:44:15.060 36 Chargedoff 0.46201 NaN Not Available NaN NaN NaN nan False Not displayed 1970 1789 nan
74263 2006-03-02 19:00:17.593 36 Defaulted 0.48873 NaN Not Available NaN NaN NaN nan False Not displayed 225 3000 nan
51369 2006-03-16 19:30:16.753 36 Defaulted 0.50633 NaN Not Available NaN NaN NaN nan False Not displayed 211 3000 nan
32627 2006-02-23 13:57:02.087 36 Completed 0.51229 NaN Not Available NaN NaN NaN nan False Not displayed 0 1500 nan
In [34]:
loan_clean.BorrowerAPR.describe()
Out[34]:
count   103178.00000
mean         0.22260
std          0.08137
min          0.00653
25%          0.15803
50%          0.21470
75%          0.28851
max          0.51229
Name: BorrowerAPR, dtype: float64

The distribution of BorrowerAPR is approximatly normal with a mean of 22.26%, median of 21.47%, minmun rate of 0.65% and maximum of 51.23%. and most of the values ranges between 10% and 45%.

In [35]:
# Ploting the distribution of LoanStatus
LoanStatus_order=loan_clean["LoanStatus"].value_counts().index
fig, ax = plt.subplots(figsize=(12, 6))
sns.countplot(data=loan_clean, x='LoanStatus', color=base_color, ax=ax, order=LoanStatus_order)
sns.despine(left=True);

LoanStatus: The most status is Current followd by completed, Chargedoff and Defaulted and the rest is less than 400 and Canceled loans are almost 0.

Now Let's take a look for other categorical features distributions

In [36]:
# Listing the columns to plot
cols = ['ProsperScore', 'Term', 'EmploymentStatus', 'IncomeRange']

# Create 5 rows subplots
fig, ax = plt.subplots(figsize=(10, 20), nrows=4)

# Plotting loop
for i, col in enumerate(cols):
    # If the column dtype is ordered category use it`s order if not order by highest value
    if loan_clean[col].cat.ordered:
        order = None
    else:
        order = loan_clean[col].value_counts().index
    
    # Ploting
    sns.countplot(data=loan_clean, x=col, order=order,color=base_color, ax=ax[i])
    sns.despine(left=True)
    
    # If the number of unique values more the 7 turn switch the axis to prevent the ticks from overlapping
    if col == 'IncomeRange':
        plt.xticks(rotation=90)

The prosper rating and score distribution is normal, most term used is 36 Months and most of the individuals income range falls between 25,000 and 75,000.

In [37]:
# Listing the columns to plot
cols = ['ListingCategory', 'BorrowerState', 'Occupation']

# Create 3 columns subplots
fig, ax = plt.subplots(figsize=(20, 20), ncols=3)

# Plotting loop
for i, col in enumerate(cols):
    # If the column dtype is ordered category use it`s order if not order by highest value
    if loan_clean[col].cat.ordered:
        order = None
    else:
        order = loan_clean[col].value_counts().index
    
    # Ploting
    sns.countplot(data=loan_clean, y=col, order=order,color=base_color, ax=ax[i])
    sns.despine(left=True)
    
  • The top reason for prosper loans is debt consilidation and there is a lot of people didn't reveal for the reason by chosing nothing.
  • Most of the Borrower was from California followed by Florida and Texas
In [38]:
# Creating plots for listing creation data by year and month
fig, ax = plt.subplots(figsize=(16, 6), ncols=2)
g = sns.countplot(x=loan_clean.ListingCreationDate.dt.year, color=base_color, ax=ax[0])
sns.despine(left=True)
g.set(xlabel='Loan Creation Year');

order = ['January', 'February', 'March', 'April', 'May', 'June', 
         'July', 'August', 'September', 'October', 'November', 'December']
g = sns.countplot(x=loan_clean.ListingCreationDate.dt.month_name(), order=order, color=base_color, ax=ax[1])
sns.despine(left=True)
g.set(xlabel='Loan Creation Month')
plt.xticks(rotation=45);
  • As we can see here the number loans from 2006 was increasing then dropped in 2009 and started to rise again.
  • And people tend to borrow money more in the 2nd half of the year.
In [39]:
# Preparing for pie plot
isHomeOwner = loan_clean.IsBorrowerHomeowner.value_counts()

# Creating a pie chart for IsBorrowerHomeowner colum
plt.pie(isHomeOwner,
        labels=['Home Owner', 'Not a Home Owner'],
        startangle=90,
        counterclock=False,
        wedgeprops = {'width' : 0.5},
        colors=['#00e560', '#e50020'])
plt.axis('square');
  • Nothing intersted here almost 50% of individuals is home owners and the other 50% not.
In [40]:
# Ploting the distribution of EmploymentStatusDuration
binsize = 20
bins = np.arange(0, loan_clean.EmploymentStatusDuration.max()+binsize, binsize)
fig, ax = plt.subplots(figsize=(16, 6), ncols=2)
ax[0].hist(loan_clean.EmploymentStatusDuration, color=base_color, bins=bins)
ax[0].spines['top'].set_visible(False)
ax[0].spines['right'].set_visible(False)
ax[0].spines['left'].set_visible(False)
ax[0].set_xlabel('Employment Status Duration');

# Ploting the log of EmploymentStatusDuration
log_bins = 10 ** np.arange(0, 3.0+0.1, 0.1)
ticks = [ 1, 3, 10, 30, 100, 300, 1000]
ax[1].hist(loan_clean.EmploymentStatusDuration, color=base_color, bins=log_bins)
ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
ax[1].spines['left'].set_visible(False)
ax[1].set_xscale('log')
ax[1].set_xticks(ticks)
ax[1].set_xticklabels(ticks)
ax[1].set_xlabel('Employment Status Duration (log)');
  • Employment Status Duration has a long-tailed distribution with a lot of emplyees between $0 \to 100$ months in their current position
    when plotted the log-scale, the distribution is roughly normal with a peak around $80 \to 90$ months.
In [41]:
# Ploting Credit score ranges lower and upper
binsize = 10
bins = np.arange(0, loan_clean.AvgCreditScore.max()+binsize, binsize)
# Create a 1x2 Plot
fig, ax = plt.subplots(figsize=(16, 5), ncols=2)

# Ploting AvgCreditScore
ax[0].hist(loan_clean.AvgCreditScore, color=base_color, bins=bins)
ax[0].spines['top'].set_visible(False)
ax[0].spines['right'].set_visible(False)
ax[0].spines['left'].set_visible(False)
ax[0].set_xlabel('Average Credit Score');

# Ploting AvgCreditScore
ax[1].hist(loan_clean.AvgCreditScore, color=base_color, bins=bins)
ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
ax[1].spines['left'].set_visible(False)
ax[1].set_xlabel('Average Credit Score')
ax[1].set(xlim=[450, 850]);
  • The distribution of average credit score ranges from about $450 \to 850$ and it's roughly normal with a peak around $670 \text{ and } 690$
In [42]:
# Ploting LoanCurrentDaysDelinquent
# Create a 2x2 Plot
fig, ax = plt.subplots(figsize=(20, 6), ncols=2)

# Ploting LoanCurrentDaysDelinquent 
binsize_delinquent = 50
bins_delinquent = np.arange(0, loan_clean.LoanCurrentDaysDelinquent.max()+binsize_delinquent, binsize_delinquent)
ax[0].hist(loan_clean.LoanCurrentDaysDelinquent, color=base_color, bins=bins_delinquent)
ax[0].spines['top'].set_visible(False)
ax[0].spines['right'].set_visible(False)
ax[0].spines['left'].set_visible(False)
ax[0].set_xlabel('Loan Current Days Delinquent');

# Ploting LoanCurrentDaysDelinquent Zoomed in range from 1 to 300
binsize_delinquent_zoomed = 5
bins_delinquent_zoomed = np.arange(1, loan_clean.LoanCurrentDaysDelinquent.max()+binsize_delinquent_zoomed, binsize_delinquent_zoomed)
ax[1].hist(loan_clean.LoanCurrentDaysDelinquent, color=base_color, bins=bins_delinquent_zoomed)
ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
ax[1].spines['left'].set_visible(False)
ax[1].set_xlabel('Loan Current Days Delinquent')
ax[1].set(xlim=[1, 300]);
  • Loan delinquent days most users pay on time with most of them with 0 delayed days then from $1 \to 300$ is roughly uniform with a peck around 120 days or 4 months and few delayed more than a year and we will go deeper into this and see that those for defaulted loans.
In [43]:
# Ploting Loan Original Amount Distribution
plt.figure(figsize=(10, 5))
sns.histplot(loan_clean.LoanOriginalAmount, color=base_color, kde=False, binwidth=1000)
sns.despine(left=True)
plt.axvline(x=loan_clean.LoanOriginalAmount.mean(), color='#20fa30', linewidth=3) # Show the mean with a green virtical line
plt.axvline(x=loan_clean.LoanOriginalAmount.median(), color='#2030fa', linewidth=3); # Show the median with a blue virtical line
  • Since mean exceeds the median , the distribution is expected to be right skewed, which is obvious from the histogram, other than couple of peaks. Moreover, we notice that it is quite rare for borrowers to ask for large amount of loans through prosper.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

  • I'm interested in 2 variables BorrowerAPR and LoanStatus.
  • TAs mentioned above the distribution of BorrowerAPR is roughly normal with a peak around 35% and didn't to perform any kind of transformations.
  • LoanStatus: The most status is Current followd by completed, Chargedoff and Defaulted and the rest is less than 400 and Canceled loans are almost 0.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

  • When investigatin number of Loans per year we found a drop in numbers in 2009.
  • The most reason for loans is 'Debt Consilidations', Most of individuals were from California state, and top occupations were (other, professional and computer programmer)

Bivariate Exploration

Let's start by plotting the correlation coeffetient for the numeric variables in our data

In [44]:
# Listing the numeric variables
num_cols = ['BorrowerAPR', 'ProsperScore', 'EmploymentStatusDuration', 
            'AvgCreditScore', 'LoanCurrentDaysDelinquent', 'LoanOriginalAmount']
# Heatmap Plot
plt.figure(figsize=(8, 6))
sns.heatmap(loan_clean[num_cols].corr(),
            annot=True,
            fmt='.3f',
            cmap='YlGnBu',
            center=0,
            linewidths=.5,
            square=True);
In [45]:
# Sample for Pair Grid
loan_samp = loan_clean.sample(1000)
In [46]:
# Pair grid plot for numeric variables
g = sns.PairGrid(data=loan_samp, vars=num_cols)
g = g.map_diag(sns.histplot, kde=False, color=base_color)
g.map_offdiag(sns.scatterplot, color=base_color, alpha=0.4, s=10);
  • based on the scatter plots and the correlation matrix we find that there is a negative relationship between BorrowerAPR and LoanOrignalAmount, we may expect that the bigger the loan amount the less the interest rate.
  • Also a negative relationship between BorrowerAPR and ProsperScore we may expect that the bigger the score the less the interest rate.
  • And finaly another negative relationship between BorrowerAPR and AvgCreditScore we may expect that the bigger average score the less interest rate.
In [47]:
# List the high correlated variables with BorrowerAPR to zoom into their scatter plot
apr_high_corr = ['ProsperScore', 'AvgCreditScore', 'LoanOriginalAmount']

fig, axis = plt.subplots(figsize=(16, 20), nrows=3, ncols=2, sharey=True)
for i, var in enumerate(apr_high_corr):
    sns.regplot(data=loan_clean,
                x=var,
                y='BorrowerAPR',
                color=base_color,
                scatter_kws={'alpha':0.4, 's':5},
                ax=axis[i, 0])
    sns.despine(left=True);

    sns.histplot(x=loan_clean[var].astype('float64'),
                 y=loan_clean.BorrowerAPR,
                 color=base_color,
                 cbar=True,
                 ax=axis[i, 1])
    sns.despine(left=True);
In [48]:
loan_clean.select_dtypes(include=['category', 'bool']).describe()
Out[48]:
Term LoanStatus ProsperScore ListingCategory BorrowerState Occupation EmploymentStatus IsBorrowerHomeowner IncomeRange
count 103178 103178 74119 103178 97688 100137 100948 103178 103178
unique 3 7 11 21 51 67 8 2 8
top 36 Current 6 Debt Consolidation CA Other Employed True $25,000-49,999
freq 80390 45933 10912 49860 13294 26302 58024 51803 29482
In [49]:
loan_clean.select_dtypes(exclude=['category', 'bool']).describe(datetime_is_numeric=True)
Out[49]:
ListingCreationDate BorrowerAPR EmploymentStatusDuration LoanCurrentDaysDelinquent LoanOriginalAmount AvgCreditScore
count 103178 103178.00000 95578.00000 103178.00000 103178.00000 102612.00000
mean 2011-04-03 07:04:32.053993984 0.22260 94.73928 168.75002 7965.84034 694.23512
min 2006-01-06 15:21:19.047000 0.00653 0.00000 0.00000 1000.00000 9.50000
25% 2008-07-11 06:58:03.450999808 0.15803 25.00000 0.00000 3500.00000 669.50000
50% 2012-03-08 06:03:38.520000 0.21470 66.00000 0.00000 6000.00000 689.50000
75% 2013-06-13 09:04:30.018000128 0.28851 135.00000 0.00000 10195.00000 729.50000
max 2013-12-31 22:28:19.927000 0.51229 755.00000 2704.00000 35000.00000 889.50000
std NaN 0.08137 93.50315 487.27972 6075.00870 68.76704
In [50]:
# Listing the categoric variables
cat_cols = ['Term', 'LoanStatus', 'ProsperScore',
            'EmploymentStatus', 'IsBorrowerHomeowner', 'IncomeRange']
In [51]:
# Creating 3x2 plot for the vars with unique values less than 15
fig1, ax = plt.subplots(figsize=(20, 20), ncols=2, nrows=3, constrained_layout=True) # Constrained_layout to avoid plots overlapping
# Flatten the axis into 1d numpy array
ax = ax.flatten()

# Plotting loop
for i, var in enumerate(cat_cols):
    sns.boxplot(data=loan_clean, x=var, y='BorrowerAPR', color='#33b5e5', ax=ax[i])
    sns.despine(left=True, ax=ax[i])
    if var in ['LoanStatus', 'EmploymentStatus', 'IncomeRange']:
            plt.setp(ax[i].get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
In [52]:
# plotting the distribution for BorrowerAPR by ListingCategory
order = loan_clean.ListingCategory.value_counts().index
g = sns.FacetGrid(data = loan_clean, 
                  col='ListingCategory',
                  col_wrap=5,
                  col_order=order,
                  sharey=False,
                  height=2,
                  aspect=1.5)

g.map(sns.histplot,
      'BorrowerAPR',
      kde=False, 
      color=base_color)
      
g.set_titles('{col_name}');
In [53]:
# plotting the distribution for BorrowerAPR by BorrowerState
order = loan_clean.BorrowerState.value_counts().index
g = sns.FacetGrid(data = loan_clean, 
                  col='BorrowerState',
                  col_wrap=6,
                  col_order=order,
                  sharey=False,
                  height=2,
                  aspect=1.5)

g.map(sns.histplot,
      'BorrowerAPR',
      kde=False, 
      color=base_color)
      
g.set_titles('{col_name}');
In [54]:
# plotting the distribution for BorrowerAPR by Occupation
order = loan_clean.Occupation.value_counts().index
g = sns.FacetGrid(data = loan_clean, 
                  col='Occupation',
                  col_wrap=7,
                  col_order=order,
                  sharey=False,
                  height=2,
                  aspect=1.5)

g.map(sns.histplot,
      'BorrowerAPR',
      kde=False, 
      color=base_color)
      
g.set_titles('{col_name}');

based on the boxplots we can see that:

  • there is no clear relation between BorrowerAPR and the payment Term.
  • the lowest median for BorrowerAPR is for Completed followed by Current LoanStatus.
  • the negative relation between BorrowerAPR and ProsperScore is appearing here again.
  • there is a relation between BorrowerAPR and EmploymentStatus as it tends to be lower for emplyed individuals than not employed with some cosiderations for retired and part time jobs to be the lowest.
  • the BorrowerAPR tend to be lower for Homeowners than if the individual doesn't own a house.
  • again we see that the BorrowerAPR tend to be high for not emploed individuals in IncomeRange with some considerations for people with 0 income.
  • and for the other relation between BorrowerAPR and ListingCategory some reasons tend to get lower rate than others but we will dig more in multivariate exploration to see if this effect because of the ListingCategory or there is another correlated variable that make this effect.
  • and about the relation between BorrowerAPR and the BorrowerState and Occupation are not clear.
In [55]:
# Listing the columns to plot
cols = ['LoanStatus', 'Term', 'EmploymentStatus', 'IsBorrowerHomeowner', 'IncomeRange']

# Create 5 rows subplots
fig, ax = plt.subplots(figsize=(16, 20), nrows=5, constrained_layout=True)

# Plotting loop
for i, col in enumerate(cols):
    # If the column dtype is ordered category use it`s order if not order by highest value
    try:
        if loan_clean[col].cat.ordered:
            order = None
        else:
            order = loan_clean[col].value_counts().index
    except:
        order = None
        pass
    
    # Ploting
    sns.countplot(data=loan_clean, x=col, order=order, hue='ProsperScore', palette='Blues', ax=ax[i])
    sns.despine(left=True)
    ax[i].legend(loc=1, ncol=2, framealpha=.7)
    # Prevent the ticks from overlapping for IncomeRange and LoanStatus
    if col in ['IncomeRange', 'LoanStatus']:
        plt.setp(ax[i].get_xticklabels(), rotation=30, ha="right", rotation_mode="anchor")
  • Nothing clear here the distribution for ProsperScore is roughly the same for other categories.
In [56]:
# Ploting countplot for LoanStatus by IncomeRange
# Create 5 rows subplots
fig, ax = plt.subplots(figsize=(16, 5))

# Ploting
sns.countplot(data=loan_clean, x='LoanStatus', hue='IncomeRange', palette='RdBu', ax=ax)
sns.despine(left=True)
ax.legend(loc=1, ncol=2, framealpha=.7);
plt.xticks(rotation=90);
In [57]:
# LoanStatus vs IsBorrowerHomeOwner
plt.figure(figsize=(16, 5))

sns.countplot(data=loan_clean,
              x='LoanStatus',
              hue='IsBorrowerHomeowner',
              palette=['#e50020', '#00e560'])
sns.despine(left=True)
plt.xticks(rotation=90);
In [58]:
# Preparing for HeatMap plot 
df_loan_income = loan_clean.groupby(['LoanStatus', 'IncomeRange']).count()['ListingCreationDate'].reset_index()
df_loan_income.columns = ['LoanStatus', 'IncomeRange', 'Count']
df_loan_income = df_loan_income.pivot(index = 'LoanStatus', columns = 'IncomeRange', values = 'Count').fillna(0)
In [59]:
# Heatmap Plot
plt.figure(figsize=(12, 6))
sns.heatmap(df_loan_income,
            annot=True,
            fmt=',.0f',
            cmap='rocket_r',
            mask=(df_loan_income<0.5),
            linewidths=.5,
            square=True);
  • As we can see here the distribution of IncomeRange vary accross the LoanStatus
  • 0 Not displayed IncomeRange for current loans which means that the policy of the website changed lately.
  • Most of Defaulted loans were for individuals who didn't displayed their income.
In [60]:
# Preparing for HeatMap plot 
df_loan_employment = loan_clean.groupby(['LoanStatus', 'EmploymentStatus']).count()['ListingCreationDate'].reset_index()
df_loan_employment.columns = ['LoanStatus', 'EmploymentStatus', 'Count']
df_loan_employment = df_loan_employment.pivot(index = 'LoanStatus', columns = 'EmploymentStatus', values = 'Count').fillna(0)
In [61]:
# Heatmap Plot
plt.figure(figsize=(12, 6))
sns.heatmap(df_loan_employment,
            annot=True,
            fmt=',.0f',
            cmap='rocket_r',
            mask=(df_loan_employment<0.5),
            linewidths=.5,
            square=True);
  • As we can see here the distribution of EmploymentStatus vary accross the LoanStatus
  • 0 Not available EmploymentStatus for current loans which means that the policy of the website changed lately.
  • Most of the current loans are for employed individuals.
  • Most of Defaulted loans were for individuals with full time job.
In [62]:
# Creating a df for map plotting
df_map = loan_clean[['BorrowerState', 'LoanOriginalAmount', 'BorrowerAPR']].groupby('BorrowerState').agg({'BorrowerAPR':'mean', 'LoanOriginalAmount':'sum'}).reset_index()
df_map.columns = ['state', 'apr', 'amount']

df_map.head()
Out[62]:
state apr amount
0 AK 0.21635 1531721
1 AL 0.24288 11150808
2 AR 0.24234 5334223
3 AZ 0.22074 13403979
4 CA 0.22190 114588890
In [63]:
# Setting the figure for plotly map
fig = go.Figure(data=go.Choropleth(locations=df_map.state, # Spatial coordinates
                                   z = df_map.apr.astype(float) * 100 , # Data to be color-coded
                                   locationmode = 'USA-states', # set of locations match entries in `locations`
                                   text=df_map.state,
                                   colorscale = 'RdBu',
                                   autocolorscale=False,
                                   marker_line_color='darkgray',
                                   marker_line_width=0.5,
                                   colorbar_title = "APR %",
                                   colorbar_tickprefix = '%',
                                   hovertemplate = '<i>APR</i>: %{z:.2f}<extra>%{text}</extra>', # Hoovering Text
                                   showlegend=False))


# Setting the layout
fig.update_layout(
    hoverlabel_align = 'right',
    title_text = 'Average BorrowerAPR by State',
    geo_scope='usa', # limite map scope to USA
)
In [64]:
# Setting the figure for plotly map
fig = go.Figure(data=go.Choropleth(locations=df_map.state, # Spatial coordinates
                                   z = df_map.amount.astype(float), # Data to be color-coded
                                   locationmode = 'USA-states', # set of locations match entries in `locations`
                                   text=df_map.state,
                                   colorscale = 'Blues',
                                   autocolorscale=False,
                                   marker_line_color='darkgray',
                                   marker_line_width=0.5,
                                   colorbar_title = "USD",
                                   colorbar_tickprefix = '$',
                                   hovertemplate = '<i>Amount</i>: $%{z:,.0f}<extra>%{text}</extra>', # Hoovering Text
                                   showlegend=False))


# Setting the layout
fig.update_layout(
    hoverlabel_align = 'right',
    title_text = 'Total original loan amount by State',
    geo_scope='usa', # limite map scope to USA
)

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • LoanOriginalAmount, ProsperScore and AvgCreditScore are the most features that affect the BorrowerAPR as the larger they are the lower the interest rate is.
  • LoanStatus is effected the most by IncomeRange and EmploymentStatus as most of defaulted loans are for people who were unemployed.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • AvgCreditScore and LoanOriginalAmount is positively correlated we can expect that the higher score the higher the loan amount the individual can borrow.
  • The highest total LoanOriginalAmount was for individuals who are located in California state.

Multivariate Exploration

In [65]:
def hist2dgrid(x, y, **kwargs):
    """ Quick hack for creating heat maps with seaborn's PairGrid. """
    palette = kwargs.pop('color')
    bins_x = np.arange(0, 12, 1)
    bins_y = np.arange(0, 0.6, 0.1)
    plt.hist2d(x, y, bins = [bins_x, bins_y], cmap = palette, cmin = 0.5)
In [66]:
df_hist2d = loan_clean.query('EmploymentStatus != "Not available"').copy()
In [67]:
# Ploting heatmap for ProsperScore and BorrowerAPR by EmploymentStatus
g = sns.FacetGrid(data = df_hist2d,
                 col='EmploymentStatus',
                 col_wrap = 4,
                 height = 3)
g.map(hist2dgrid, 'ProsperScore', 'BorrowerAPR', color = 'inferno_r')
g.set_titles('{col_name}')
g.despine(left=True);
In [68]:
# Ploting heatmap for ProsperScore and BorrowerAPR by IsBorrowerHomeowner
g = sns.FacetGrid(data = df_hist2d,
                 col='IsBorrowerHomeowner',
                 col_wrap = 4,
                 height = 3)
g.map(hist2dgrid, 'ProsperScore', 'BorrowerAPR', color = 'inferno_r')
g.set_titles('{col_name}')
g.despine(left=True);
In [69]:
# Ploting heatmap for ProsperScore and BorrowerAPR by LoanStatus
g = sns.FacetGrid(data = df_hist2d,
                 col='LoanStatus',
                 col_wrap = 4,
                 height = 3)
g.map(hist2dgrid, 'ProsperScore', 'BorrowerAPR', color = 'inferno_r')
g.set_titles('{col_name}')
g.despine(left=True);
<__array_function__ internals>:5: UserWarning:

Warning: converting a masked element to nan.

C:\Users\aghar\anaconda3\envs\dapnd\lib\site-packages\numpy\core\_asarray.py:83: UserWarning:

Warning: converting a masked element to nan.

In [78]:
# Plotting the relation between ProsperScore, BorrowerAPR and EmployementStatus
fig, ax = plt.subplots(figsize=(20, 8))
sns.pointplot(data=loan_clean,
               x='LoanStatus',
               y='BorrowerAPR',
               hue='EmploymentStatus',
               linestyles="",
               dodge=0.5,
               palette='inferno_r')
sns.despine(left=True)
plt.legend(loc=2, ncol=8, framealpha=0.7)
plt.show();
In [71]:
# Creating a df for map plotting
df_scatter = loan_clean.copy()
df_scatter['year'] = df_scatter.ListingCreationDate.dt.year
df_scatter = df_scatter[['BorrowerState',
                         'year',
                         'LoanOriginalAmount',
                         'BorrowerAPR',
                         'ListingCreationDate']].groupby(['BorrowerState', 'year']).agg({'BorrowerAPR':'mean',
                                                                                         'LoanOriginalAmount':'mean',
                                                                                         'ListingCreationDate':'count'}).reset_index()
df_scatter.columns = ['state', 'year', 'AverageAPR', 'AverageLoanAmount', 'count']

# Setting the text for hovering over the state
df_scatter.head()
Out[71]:
state year AverageAPR AverageLoanAmount count
0 AK 2006 0.09909 2500.00000 2
1 AK 2007 0.12979 7476.92308 13
2 AK 2008 0.16638 5676.33333 18
3 AK 2009 0.19948 8500.00000 2
4 AK 2010 0.28769 5245.45455 11
In [72]:
# Plotting animated scatter
fig = px.scatter(df_scatter,
                 x="AverageLoanAmount",
                 y="AverageAPR",
                 animation_frame="year",
                 animation_group="state",
                 size="count",
                 color="state",
                 size_max=50)


fig.layout.update(title_text="Average BorrowerAPR vs Average Orignal Amount by State (2006-2013)",
                  title_font_size=14,
                  showlegend=True,
                  transition= {'duration':20000000})

fig.show()

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Relation between EmploymentStatus, ProsperScore and BorrowerAPR

  • Not employed idividuals have the highest APR for current loans suggesting that now Prosper company consider them the highest risk and the lowest rate for employed individuals.

Were there any interesting or surprising interactions between features?

  • Average Loan amount was increasing from 2006 to 2008 until it dropped in 2009 regards then it rise again and containue increasing while the average BorrowerAPR was decreasing.
In [ ]: